Towards Structuring Unstructured GenBank Metadata for Enhancing Comparative Biological Studies

نویسندگان

  • Elizabeth S. Chen
  • Indra Neil Sarkar
چکیده

Within large sequence repositories such as GenBank there is a wealth of metadata providing contextual information that may enhance search and retrieval of relevant sequences for a range of subsequent analyses. One challenge is the use of free-text in these metadata fields where approaches are needed to extract, structure, and encode essential information. The goal of the present study was to explore the feasibility of using a combination of existing resources for annotating unstructured GenBank metadata, initially focusing on the "host" and "isolation_source" fields. This paper summarizes early results for 10 host organisms that include a characterization of associated isolation sources with respect to biomedical ontologies and semantic types. The findings from this preliminary study provide insights to the rich amount of information captured within these unstructured metadata, guidance for addressing the challenges and issues encountered, and highlight the potential value for enriching comparative biological studies towards improving human health.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Accelerated Biological Meta-Data Generation and Indexing on the Cray XD1

The volume and diversity of biological information to be integrated in comparative bioinformatics studies continues to grow. Increasingly, the information is unstructured and without appreciable annotation necessary to make the necessary associations for comparative analysis. The FPGA and the Cray XD1 in particular provide a means to rapidly generate the dynamic metadata information needed to e...

متن کامل

Natural Language Processing Methods for Enhancing Geographic Metadata for Phylogeography of Zoonotic Viruses

Zoonotic viruses represent emerging or re-emerging pathogens that pose significant public health threats throughout the world. It is therefore crucial to advance current surveillance mechanisms for these viruses through outlets such as phylogeography. Despite the abundance of zoonotic viral sequence data in publicly available databases such as GenBank, phylogeographic analysis of these viruses ...

متن کامل

New Computational Tools for Brassica Genome Research

With the increasing quantities of Brassica genomic data being entered into the public domain and in preparation for the complete Brassica genome sequencing effort, there is a growing requirement for the structuring and detailed bioinformatic analysis of Brassica genomic information within a user-friendly database. At the Plant Biotechnology Centre, Melbourne, Australia, we have developed a seri...

متن کامل

Personalized Structuring of Retrieved Items

People nowadays struggle with huge unstructured collections containing some important knowledge. Search engines were developed to allow an easy access to relevant information, however they themselves produce for many queries large (unstructured) result sets. We suggest to tackle this problem by structuring these result lists or even complete collections in a personalized way, i.e. as the user w...

متن کامل

Knowledge-driven geospatial location resolution for phylogeographic models of virus migration

UNLABELLED Diseases caused by zoonotic viruses (viruses transmittable between humans and animals) are a major threat to public health throughout the world. By studying virus migration and mutation patterns, the field of phylogeography provides a valuable tool for improving their surveillance. A key component in phylogeographic analysis of zoonotic viruses involves identifying the specific locat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 2011  شماره 

صفحات  -

تاریخ انتشار 2011